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Abstract 

This is a review of a set of recent papers with some new data added. After a 
brief biological introduction a visualization scheme of the string composition of long 
DNA sequences, in particular, of bacterial complete genomes, will be described. 
This scheme leads to a class of self-similar and self-overlapping fractals in the limit 
of infinitely long constituent strings. The calculation of their exact dimensions and 
the counting of true and redundant avoided strings at different string lengths turn 
out to be one and the same problem. We give exact solution of the problem using 
two independent methods: the Goulden-Jackson cluster method in combinatorics 
and the method of formal language theory. 



*On leave from the Institute of Theoretical Physics, Academia Sinica, Beijing 100080, China. E-mail: 
hao@itp.ac.cn 



The new paradigm, now emerging, is that all "genes" will be known (in the 
sense of being resident in databases available electronically) , and that the start- 
ing point of a biological investigation will be theoretical. 



- Walter Gilbert (1991) 

The above statement made by a biologist wet-experimentalist, the 1980 Nobel laureate 
Walter Gilbert in a Nature column entitled "Towards a paradigm shift in biology" sounds 
very encouraging for us physicists-theorists. Indeed, the rapid accumulation of huge 
amount of biological data, in the first place, the DNA and protein sequence data, makes 
it clear that further breakthrough in understanding living matter and life phenomenon 
would be impossible without an interdisciplinary effort of scientists of all walks. The 
horizon is so broad that physicists with any background may quickly find some point 
to cut in. With our experience in symbolic dynamicsa we naturally choose the long 
symbolic sequences of DNA to start with. 

1 Introduction 

The genetic information of all organisms except for so-called RNA-viruses is encoded in 
thier DNA sequences. A DNA sequence is a long unbranched polymer made of four differ- 
ent kinds of monomers -- nucleotides. As long as the encoded information is concerned 
we can ignore the fact that DNA exists as a double helix of two "conjugated" strands and 
treat it as a one-dimensional symbolic sequence made of four letters a, c, g, and t, repre- 
senting the nucleotides adenine, cytosine, guanine, and thymine, respectively. Since the 
first complete genome of a free-living organism, Mycoplasma genitalium, was sequenced 
in 1995 the number of available complete genomes has been growing steadily. As of 15 
October 1999 there are in total 4 864 570 sequences containing 3 841 163 Oil letters in 
the GenBanWA . Among these sequences there are more and more complete genomes, 
including 23 bacteria and a few eukaryotes. 

The availability of complete genomes of organisms allows one to ask many questions 
of global nature. For example, a biochemist might look at all enzymes that catalyze the 
thousands of biochemical reactions in a cell that make life going and to infer the whole 
network of metabolic pathways. Perhaps the simplest global question one can imagine 
consists in whether there exist short strings made of the four letters that do not appear in 
a genome. First of all, this is a question that can be asked only nowadays when complete 
genomes are at our hands, as it does not make sense when dealing with small pieces of 
DNA segments. Secondly, as it will become clearer when we introduce some notions from 
language theory, there is a deeper reason to ask this question since in a sense a complete 
genome defines a language which is entirely specified by a minimal set of "forbidden 
words" . 

The visualization scheme of the string composition of long DNA sequences described 
early in [[J inspires a few neat mathematical problems which can be solved precisely by 
using at least two different approaches. Brief accounts of these solutions are scheduled 
to appear in two conference proceedingsQ' El . The data collected in Tables [l] and § are 
presented for the first time. As language theory approach and the combinatorial technique 
used in the work may be quite instructive for other problems we think it appropriate to 
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Figure 1: Allocation of counters for string length K = 1, 2, and 3. 

present them in more details in this review in order to enable more physicists to make 
acquaintance with these methods. 

2 The Visualization Scheme and 
Self- Overlapping Fractals 

Given a bacterial complete genome of length N, i.e., a linear or circular DNA sequence 
made of N letters from the alphabet E = {a, c, g, t}, we are interested in the frequency of 
appearance of various strings of length K. There are 4 K possible different ^-strings so 
we need that many counters to do the counting. We display the counters in a fixed-size 
square frame on a computer screen. The frames for K — 1, 2, and 3 would look like what 
are shown in Fig. [|. 

If we present the K = 1 frame as a 2 x 2 matrix 



M 



9 c 
a t 



then the K = 2 frame is just a direct product of two copies of M: 

M (2) =M®M -- 



gg gc eg cc 

ga gt ca ct 

ag ac ta tc 

aa at ta tt 



In general, a i^-frame is given by 

M (K) =M®M®---®M, 
whose element is expressed via the elements of the 2x2 matrices as 



M {K) 



M hjl M i2h 



M 



i>k3k- 



In order to facilitate the computation, it is better to use binary indices for the matrix M, 
i.e., let 

Moo = 9, Mm. = c, M 10 = a, M u = t. 

The indices (ziji) • • • (ikJk) follow from the input sequences 



S1S2S3 • • ■ s K s K+1 • • • 



By sliding a window of width K along the genome we get N or N — K + 1 total 
counts for a circular or linear sequence. Every segment of length K in the input sequence, 
taken as a number in base 4, points to the array element of its own counter. In order to 
implement this we introduce a mapping 

a: {g, c, a, t} ^ {00, 01, 10, 11} 

for each letter in the input sequence. For the first iT-string S1S2 ■ ■ ■ Sk of the input 
sequence we get a number 

K-\ 

index = £ 4 i ^~ l ~ 1 a(s i ), 

i=0 

which is nothing but the index used to locate its counter. In order to get the new index 
index' for the next X-string, it is enough to discard the contribution of the first letter 
in the previous string and take into account the next new letter. This is easily done by 
using binary operations: 

index' = 4 x (index(mod A K ~ V )) + a(sx+i)- 

We display the A K counters as a 2 K x 2 K square on the screen. The counter for the 
first ^-string is centered at (x, y) : 



x 



Y / 2 K -*- 1 (a(s l )kE), 



i=0 
K-\ 



y = £ 2*"*" 1 («(*) >> i), 



i=0 



where &zE means logical and with the base-4 unit E = 01 and > > 1 means left shift by 
one. Again, for the location (x', y') of the next fC-string one needs only to correct for the 
new input letter: 

x' = 2 x (x{mod 2 K - 1 )) + a{s K+ i)kE, 
y' = 2 x (y(mod 2 K ~ 1 )) + a(s K+1 ) > > 1. 

We note that this leads to a counting algorithm that depends only on the total length N 
of the genome but not on the string length K. This saves some computer time when K 
gets large. 

Applying the above algorithm to the K = 8 strings in the 4 693 221-letter long genome 
of E. coli, we get the picture shown in Fig. [[].[] 

We have used a very crude color code of 16 colors, including black and white. As our at- 
tention is concentrated on those strings that do not appear or that are under-represented, 
we allocate most of the bright colors to small counts with white color representing avoided 
strings. This is a kind of coarse-graining which makes some features of the figure more 
prominent. In particular, the presence of some seemingly regular patterns in Fig. [I] may be 
understood as caused by under-representation of strings that contain ctag as a substring. 
In Fig. |l] we show the counting frames for K — 6, 7, 8, and 9 in which the locations 
of strings that contain ctag, or in short, ctag-tagged strings, are marked with a small 



lr riie reader may download the original Figs, [l] and |] or the PostScript file of this preprint to see 
colors. 




Figure 2: Frequency of 8-strings in the complete genome of E. coli. The characteristic 
patterns are caused primarily by the under-representation of ctag-tagged strings. 



rhombic. We see that the basic features remain unchanged while more and more fine 
patterns appear with K increasing. The most clearly seen patterns in the E. coli portrait 
are indeed given by these ctag-tagged strings. 

Fig. [J] is to be compared with the "portrait" of a sequence (not shown), obtained by 
randomizing the E. coli genome, i.e., a sequence with the same number of nucleotides of 
each kind but with their positions shuffled at random. In such a figure all the characteristic 
patterns disappear, only some hardly perceptible contrast due to the c + g to a + t ratio 
not being equal may be noticed under a careful scrutiny. 

E. coli is not the only bacterium that does not like the ctag substring. Now 9 bacteria 
are known to have a tendency of having under-represented ctog-tagged strings. Other 
bacteria may avoid some other substrings and some may not show any apparent patterns of 
avoided substrings. For example, Fig. [| shows the "portrait" of Methanococcus jannaschii. 
Using templates of various tetranucletides similar to those shown in Fig. [I], one can identify 
at least five sets of under- represented strings tagged by ctag, cgcg, gcgc, gtac, and gate. 

A summary of what has been seen in "portraits" of all available bacterial complete 
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Figure 3: Templates of cta(7-tagged strings in the X = 6, 7, 8, and 9 frames. 



genomes is given in Table [lf| The fact that most of the under-represented tetranucleotides 
are palindromes, i.e., words that happen to be the same when read in both direct and 
reversed directions with the Watson-Crick conjugation being performed at reverse reading, 
may hints on their relation with the recognition sites of some restriction enzymes. This 
has been known to the biologists for some time, see, e.g., ||. Our observation shows its 
a quite common phenomenon in many bacterial complete genomes. 

It is appropriate to mention the relation of the above visualization scheme to the 
"chaos game representation" (CGRW) of DNA sequences. In CGR the final picture can 
only be drawn in black/white and may look quite similar to what one would obtain in 
the above visualization scheme after xeroxing the color figures on a black/white copying 
machine. There are, however, several essential differences. First, the resolution is not 
entirely under control in CGR, as different neighboring nucleotides may be resolved to 
a different precision, depending, say, on the direction of the line joining the nucleotides. 
Our method works at a fixed resolution -- the string length. Second, the algorithm of 



2 The abbreviations of bacterial names are those of the corresponding subdirectory names in GenBank, 
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Figure 4: Frequency of 8-strings in the complete genome of Methanococcus jannaschii. 
One can identify at least five sets of under-represented strings tagged by ctag, cgcg, gcgc, 
gtac, and gate. 



CGR looks a bit more complicated: put a, c, g, and t at the four corners of a square; 
staring from the center of the square plot the middle point of the straight line connecting 
two consecutive nucleotides one by one. The results turn out to be much the same as 
simple counting with fixed string length. Third, if one wish to introduce color in order to 
add more information one should calculate the density of points in CGR — an operation 
that requires big memory and that cannot be realized in a single pass. Therefore, it seems 
to us that the proposed visualization scheme makes CGR obsolete. 

3 Fractals Derived from Bacterial "Portraits" 

In genomes of organisms there are no fractals in the rigorous mathematical sense. How- 
ever, in our visualization scheme fractals may be well defined in the non-biological K — > oo 
limit. These fractals may have some suggestion in the portraits of genomes of real or- 
ganisms. Just look at the templates shown in Fig. |], one naturally sees what left in the 
original framework after deleting all small squares at finer and finer scales that represent 
all possible ctag-tagged strings does lead to a fractal. What is the fractal dimension of the 
complementary pattern defined by one or more given tags? This is not a trivial question 
as besides obvious self-similarity one has to deal with self-overlappings of the excluded 
patterns at different levels. 

Let us look at two simple examples. 

The first example is the case of a one- letter tag, e.g., g-tagged strings. Denote by a K 
the number of strings of length K that do not contain the letter g. At the zeroth level the 
linear size is So = 1, that is the size of the whole square. Since there is only one empty 
string which by definition does not contain g we have clq = 1. At the next K = 1 level, the 
linear size is 8\ = 1/2 and among the four squares of that size three do not contain g, see 
the leftmost square in Fig. [I]. Therefore, we have a^ = 3. In general, we have 8k = \f2 K 
and ax = 3 K ■ The fractal dimension is 

D = - l im l W« = !5i3. (1) 

k^oo log 8 K log 2 

In this simple example, we might have defined a trivial recursion relation for ax, namely, 

a = 1, 
aK = 3ok-i- 

Using the recursion relation one may derive a generating function f(s) for all a^-: 

oo -i 

/0) = 2 a K sK = T3V' 

K=0 l 6S 

where s is an auxiliary variable. In fact, one- letter-tagged strings exclude the largest 
number of /{"-strings, leaving a set of strings over an alphabet of three letters. This is 
the meaning of aK = 3 K and this tells us that for any possible tags the dimensions are 
included in between the limits: 

|^ < Aa g < 2. 

log 2 - tag - 

Next, look at cg-tagged strings. We first note that it is an known fact that in many 
human genes the dinucleotide eg is less represented than, e.g., the dinucleotide gc. This 

8 




Figure 5: A template for gc-tagged strings showing the overlaps at different lavels. 

leads to a characteristic pattern in the portrait of the DNA sequence that contain the 
gene. As seen from the template for the cg-tag, shown in Fig. |5|, the exclusion starts at 
the level K = 2: among the 16 possible dinucleotides only eg is avoided. At K = 3 level, 
among the 64 trinucleotides the four combinations xcg, x = {a, c, g, t} are excluded in 
addition to the four cgx,x = {a,c,g,t} which have already been excluded at the K = 2 
level. So far, no overlap of exclusions has taken place. However, at the next K = 4 level, 
one of the 16 xycg type squares, where x, y = {a, c, g, £}, namely, cgcg, is immersed in the 
K = 2 excluded square and should not be doubly counted. There are 8 such overlaps at 
K = 5, 47 at K = 6 (not shown in Fig. [5]), etc. The question is how to take into account 
these overlaps automatically. Suppose we know how to calculate the generating function 



f(s) = Y, &ks 

K=0 



K 



(2) 



then the fractal dimension is given by 

D = 



\oga K logaj^ 
hm — = hm — , 

K^oo log Ok K^oo log 2 



(3) 



where we have used the fact that 5k = 1/2 K . According to the Cauchy criterion the 
radius of convergence of the series (^j defining the generating function is determined by 



lim <2k 



1 

3>K = ' 

where Sq being the minimal module zero of / _1 (s). Thus if we know the generating 
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Table 1: Under- represented tetranucleotides seen in the bacterial genomes. 



function, the fractal dimension is given by 



D 



log I so I 
' log 2 



(4) 



Therefore, the problem of calculating the fractal dimensions reduces to that of finding 
the generating functions. This will be treated in Sections [| and || by using two different 
methods. 

We shall see that for the cg-tagged strings the generating function is 

f^ = 1 A i 2 ' 

1 — As + s z 
see Table ||. Consequently, s = 1/2 — a/3 and D = 1.8999686. 

4 Number of True and Redundant Avoided 
Strings by Direct Counting 

Once we know that there might be avoided and under-represented strings from the vi- 
sualization scheme, we can perform a direct identification of avoided strings. The direct 
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counting has the merit that the string length K is not seriously limited by the screen reso- 
lution. While the maximal K is 9 without scrolling the figure behind the screen, in direct 
counting one can go to longer K. In addition, direct counting does not miss any avoided 
strings while naked-eyes could only notice the most prominent ones. We show some of the 
results of direct counting in Table [3[]. It is a remarkable fact that the first avoided strings 
appear at length K = 6, 7, or 8 in all bacterial genomes, while statistically significant 
avoidance can only occur at much longer length in a random sequence. 

The direct counting poses another question, namely, how to count the number of true 
and redundant avoided strings. For example, in the genome of E. coli the first avoided 
string gcctagg is identified at K = 7 in contrast to a random sequence of the same length 
and nucleotide composition which would have each type of 7-strings appearing about 283 
times. At the next length K = 8 a total of 173 strings are found absent. However, 
among these 173 strings 8 must be the consequence of the lacking of gcctagg. Thus there 
are 165 true avoided strings at K = 8. Among the 5595 avoided 9-strings 48 are the 
consequence of gcctagg being absent, 1166 are redundant being the consequence of the 
165 true avoided 8-strings, only 4381 are true avoided ones at K = 9. Among these 4381 
strings 2041 do contain the palindromic tetranulcleotide dag. At K — 10 there are 114808 
true avoided strings among the total of 150409, while 256, 6531, and 28814 are redundant 
strings caused by the absence of true avoided strings at length 7, 8, and 9. How to count 
the number of redundant strings at each Kl A simple-minded estimate shows that a true 
avoided i^-string takes away 

E(i)=A i (i + l) (5) 

(K + z)-strings. We list the first E(i) below for later comparison: 
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12 
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4 
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E(i) 


1 8 48 


256 


1280 


6144 


28672 


131072 



This is obtained as follows. At the K+l level one can add one letter from the alphabet 
either in front or at the end of the avoided i^-string, thus there are 4+4 redundant avoided 
strings at length K+l. At the next length K + 2 there are three ways to add 2 letters to 
the avoided i^-string to get avoided (K + 2)-strings, each way having 4x4 combinations 
of letters. Continuation of the argument leads to Eq. [5[ However, this is usually an over- 
estimation, as it does not take into account the overlaps of letters at the begining and the 
end of a string. A simple counter-example being the 4-string gggg: there are only 7 new 
5-strings as adding a g to the head or the tail yields the same string ggggg. 

A little reflection shows that the calculation of the generating function for given tags 
and the counting of the true and redundant avoided strings are one and the same problem. 
Indeed, both problems need to take into account the overlap of substrings in making 
longer strings. The fractals provide a geometric representation of the problem as each 
small square corresponds to a well-defined type of if-string. 

5 Combinatorial Solution 

We first formulate the problem in terms of combinatorics. Let £ be an alphabet, e.g., 
S = {a, c, g, t}. Denote by E* the set of all possible finite strings made of letters from the 

3 Detailed results on avoided strings by direct counting will be published elsewhercH . 
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Bacteria Kq Nk First Avoided Strings 



gCCTAGG 

aCGCGCG 

CCTAGGg tacCTAG 

GTCGAC 

GTCGAC TCGAca 

GCGCGC GTCGAC CGATCG 

TATAatg tatgtta taaaata 

GCGCGCg CGCGCGa tGCGCGC 

GCGCGCg GCGCGCc cGCGCGC tGCGCGC 

GCGCGCg cGCGCGC gcaCTAG cACTAGT 

GCGCgta tGCGCcg ccgtgcg cgtgcga 

ggacCTAG cTCGAccc gcgaccta cgtagggg 

gCTAGtc acgCTAG tCTAGcg gCGCGCG 

aCGCGCG 

cCGaCGa cgtaggc cgatagg GCCGTCg 

aGGGCCC acgaggg taGGCCg 

CTAGtag CTAGtat gACTAGT catacta tacacta 

tagttag taagtgg ttagtaa tatttag ttattta 

gGCCGGC GCCGGCc cggCCGG CCGGggg 

CCCGGGg GGGaCCC gGGtCCg GGGtCCC 

GGaCCcg gGTCGAC GTCGACg tGTCGAC 

GGCCgg GGCCtc tcGGCC cgGCGC ccGGCC 

cCCGGc CGCGCG gccgtc ggacgc ggtcgg 

cctcgg ctcgga tcggcg tccgag 

36 contain GCGC, CGCG, GGCC, CCGG 

54 contain CTAG, 15 contain AGCT 

30 contain AATT 

96 contain GCGC, CGCG, GGCC, CCGG 

264 contain GCGC, CGCG, GGCC, CCGG 
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7 


10 
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7 


12 



Mgen 



14 



Rpxx 


7 


71 


Tpal 


8 


118 


Aero 


8 


137 


Bbur 


7 


232 


Ctra 


8 


562 



Table 2: The first avoided strings in bacterial complete genomes by direct counting. Kq 
is the minimal string length at which the first avoided strings are identified. Nk is the 
number of avoided strings at length Kq. Palindromic substrings are capitalized. 



12 



alphabet E, including the empty string. Given a set B e E* of "bad" words that we wish 
to avoid in all words we are going to use. Let A e E* be the set of all "clean" words that 
do not contain any member of B as substrings. Denote by clk the number of clean words 
of length K. 

Problem: Given E*, B, calculate clk or even better calculate the generating func- 
tion (Q) that gives a^ for all K. 

5.1 The Goulden- Jackson Cluster Method 

In combinatorics there exists a powerfull method to deal with this kind of problems — the 
Goulden- Jackson cluster method's. This method has been well-described by Noonan an 



ZeilbergerliiJ . However, we explain its basic idea and derivation in our specific context. 
First, we assign a weight to each word u: it is an auxiliary variable s raised to the power 
\u\ where \u>\ is the length of the word u>: 

weight{uj) = s' w '. 

If we can calculate the sum of weights over all clean words and reorder the terms 
according to the word length: 

oo 

f( s ) = E weight(u) = ^T a K s K , 

w&A K=0 

our task would be accomplished. Let us extend the summation over clean words to that 
over all words 

E^E 

and at the same time multiply each weight{u) by a zero raised to the power of the number 
of "bad" factors in u: 

weight{uj) =► weight^) x number of factors of u thateB , 

where by definition 

0° = 1, 
m = 0, m > 1. 

Now let us manipulate the power of zero. Suppose we have a set of 3 objects, say, 
S = {a l5 a 2 , a 3 } and we multiply three zeros n^es 0- We reorganize the elements of S into 
subsets: 

{<Ji} = {e; ax, a 2 , a 3 ; aia 2 , a 2 a 3 , a 3 ai; axa 2 a 3 }, 

where e denotes an empty subset. There are 2 3 = 8 subsets. The product of three zeros 
may be rewritten as a sum over these 8 subsets: 

no= n[o+(- i )] = E(- i ) H 5 

dies cii£S {cri} 

where |er| is the cardinality of the subset <Tj, i.e., the number of elements in a- h . This is a 
particular case of so-called Sylvester principle of inclusion-exclusion. 
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Now we can write 

/(*)= E E (-i) w * M , 

where Bad(u) denotes the set of bad factors of u>. In fact, we have got a new counting 
problem for a collection of new subjects (to, a) with a new weight function (— l)' "'^'. 
These (uj,a) may be called tagged words, i.e., a word cu tagged by a factor a G Bad(uS). 
Note that a tag a may be a combination of none or several bad factors of u. When the 
tag is empty, a = e, the word is clean. 

Denote the set of all tagged words as M. = {(u>, a)}. The weight of set M. remains 
f(s). Without loss of generality we can examine all words in Ai starting from their right 
end. The set M. contains an empty word. There are words in Ai that contain a single 
letter from the alphabet that does not form a part of any member of B. There are words 
in M. that contain a cluster of bad members from B. Thus in set-theoretical notation we 
may write 

M = {empty word} n MY, n MC, 

where C denotes clusters of bad words. 

Written in terms of weight functions, we have 



f(s)=l + f(s)ds + f(s)weight(C). 



Therefore, we have 



q — ds — weight(C) 

In the above formulas d = |E| is the cardinality of the alphabet E. In our case of 
nucleotides d = 4. When the set B is empty, i.e., no bad words at all, we have the trivial 
result 

This is just a pedantic way to say that there are 4 K words of length K. 

When the set B contains only one word u that cannot make clusters with itself, e.g., 
u = get, one simply has weightiC) = s' u ' and the problem is solved: 

When the bad word can make clusters with itself, e.g., u = gcg and a cluster being 
gcgcg, the situation is more complex and requires the technique described in the next 
subsection. Anticipating a few such results, we list all possible single-tag generating 
functions in Table |3| up to tag length K = 4. 

A related question is the number G(n) of different types of generating functions for 
a given tag length n. These numbers turn out to be independent upon the size of the 
alphabet E as long as there are more than two letters in Eliil: 

n 1 2 3 4 5 6 7 8 9 10 11 12 13 14 
G(n) 12 3 4 6 8 10 13 17 21 27 30 37 47 

In fact, these G(ri) are so-called correlations of n as given by the integer sequence 
M0555in |2f, see also [||. 
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Tag 
9 

gc 

99 

get 

gcg 



m 





l-3s 




1 


1 


-4s+s 2 




1+s 


1- 


-3s-3s 2 




1 


1 


-4s+s 3 




1+s 2 



l-4s+s 2 -3s 3 



D 

log 3 
log 2 

1.89997 

1.92269 

1.97652 

1.978 



Tag 
999 

ctag 

ggcg 
gege 

gggg 






l-3s-3s 2 -3s 3 
1 

l-4s+s 4 

1+s 3 
l-4s+s 3 -3s 4 

1+s 2 
l-4s+s 2 -4s 3 +s 4 

l+s+s 2 +s 3 
l-3s-3s 2 ~3s 3 -3s 4 



D 

1.98235 

1.99429 
1.99438 
1.99463 

1.99572 



Table 3: Generating function and dimension for some single tags. 

Applying the Goulden- Jackson cluster method to the case of only one "bad word" 
gectagg in the case of E. coli leads to the following generating function: 



m 



l + s e 



l-4s + s 6 -3s 7 ' 



The number of redundant avoided strings are obtained by subtracting the above f(s) from 
that of the "no-bad- words" case (|7]): 



1 



13 



l-4s 



f(s) = s 7 + 8/ + 48s 9 + 256s lu + 1280s 11 + 6144s 12 + 28671s" + 131063s 14 + 



These coefficients are to be compared with the naive estimates given below Eq. fl5|) As 
expected, the deviation appears from the term s 13 . 

5.2 Weight Function for Clusters 

In order to continue with the full representation of the Goulden- Jackson method we 
take the newly published complete genome of the hyperthermophilic bacterium Aquifex 
aeolicus^=^ as a non-trivial example. For this 155 1335-letter sequence four avoided strings 
are identified at string length K = 7: 



B = {gcgcgcg,gcgcgca,cgcgcgc,tgcgcgc}. 



(9) 



Since there are significant overlaps among the avoided strings, the naive estimate of 
redundant avoided words can hardly work. To treat clusters of bad words we introduce a 
few notations. Suppose that there are two bad words u,v G B. Define 



Head[v] 

Tail[u] 
Overlap(u, v) 



{proper prefixes of v}, 
{proper suffixes of u}, 
Tail[u] fl Head[v\. 



Note that the definition of Overlap(u, v) is not symmetric. Take for example, u = gcgcgcg 
and v = gegegea, we have 

Head[u] = Head[v] = {g, gc, gcg, gege, gcgcg, gegege}, 
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Tail[u] 
Tail[v] 
Overlapiu, u) 
Overlapiu, v) 
Overlapiy, u) 
Overlapiy , v) 



{g, eg, gcg, cgcg, gcgcg, cgcgcg}, 
{a, ca, gca, cgea, gegea, cgegea}, 

{g, gcg, gcgcg}, 
{9, gcg, gcgcg}, 
{} = $, 
{} = $, 



where $ denotes an empty set. If v = xx' we write v/x = x'. Thus v/ gcg = cgca. The 
weight of Overlap(U, v) is denoted as 



[u : v) 



y^ wcight{y/x). 

x£Overlap(u,v) 



Using the two above u, v as example, we have 

{u : v ) = 



y^ weight(gcgcgcafx) 
%&{g, gcg, gcgcg} 
weight (eg cgca) + weight(cgca) + weight(ca) 



s 6 + s 4 + s 2 . 



In general, we may have B = {ui, U2, ■ ■ ■ ul}- A cluster C may contain a different bad 
word at the rightmost end. We write 



C 



Ui€B 



where C[u] is a cluster with u being the rightmost part. 

As C[v] may consist of either a single v or several entangled bad words, we again have 
a set-theoretical relation: 

C[v] <=>■ {v} U ue s C[u]Overlap(u, v). 

In terms of weight functions we have 

weight(C[v]) = —weight(v) — 2J ( u '■ v)weight(C[u]). 

u£B 

This is a system of L linear equations, L being the cardinality of the set B, i.e., L — \B\. 
The minus sign in the equation comes from the weight (—1)'°"' as \a\ = 1. 
In the case of Aquifex aeolicus L — 4, see (|9|). The Overlap matrix is: 



Overlapiui, Uj 
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We have further 



(Ui : Uj) 



p 


p 


q 

















Q 


q 


P 





q 


q 


P 






where 



P 

Q 



s z + s 4 + s c 
s + s 3 + s 5 . 



Therefore, the application of th Goulden- Jackson cluster method requires the solution 
of a system of four linear equations and leads to the following generating function: 



m 



1 + s z + s 4 + s° + s 8 + s lu + s 



,12 



1 - As + s 2 - As 3 + s 4 - 4s 5 + s 6 - 4s 8 - 4s 10 - 4s 12 ' 
The numbers of redundant avoided strings are given by: 
1 



4s 



f(s) = 4s 7 + 27s 8 + 152s 9 + 784s lu + 3840s 11 + 18176s 12 + 83968s" + 



(10) 



The coefficients coincide with the negative numbers in the last row of Table |5|. 

6 Language Theory Solution 

Language theory is not just a formal object. Properly applied to the right problem it 
may provide computational frameworks and useful constructions to yield quite practical 
results. We will make use of a special class of languages, namely, so-called factorizable 
language. However, we start with a brief summary of language theory in general. 



6.1 Elements of Language Theory 

One again begins with a finite alphabet, e.g., £ = {a, c, g, t} and collects all possible strings 
made of these letters into an infinite set £*, including the empty string e, i.e., a string 
that does not contain any letter. 

Any subset L e E* is said to be a language over the alphabet E. With such a general 
definition one cannot get very far. One has to specify how the subset L is formed. This 
may be done in many ways. For example, 

1. If the subset L is finite, one can simply enumerate its elements. 

2. One can devise some production rules and by applying these rules repetitively to 
some initial letters one generates the language. This is by far the most important 
and well-studied way of defining languages. If the rules are to be applied sequentially 
it leads to the generative grammar of N. Chomsky. If applied in parallel this leads to 
the Lindenmayer or L-systems. Referring the interested readers to [rj|] and literature 
cited therein, we will not go into details of these generative grammars. 

3. For a special class of languages, namely, the factorizable languages, one can define a 
language by indicating its set of forbidden words. This is the approach we are going 
to follow in this paper. 
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However, before turning to the factorizable language we formulate a few more notions 
which will be needed later. 

According to the Chomsky classification the simplest language is called regular lan- 
guage which may be accepted or recognized by a finite automaton without any memory. 
A finite automaton has a finite number of states and it makes transition from one state to 
another by looking at an input symbol and a table of transition rules. In fact, the table of 
rules defines a discrete transfer function. For finite automata the set of input symbols is 
also finite. There are two kinds of finite automata: deterministic and non-deterministic. 
In a deterministic automaton there is a starting state and the transition rule from one 
state to another upon seeing a certain input symbol is unique. In a non-deterministic 
automaton one has the freedom to choose the start state and to decide which rule to use 
at a transition as there might be more than one rule for one and the same input symbol. 
To avoid any confusion we emphasize that deterministic and non-deterministic automata 
are entirely equivalent in their capability to define a regular language. There may be 
more than one automata that define one and the same language. Among deterministic 
automata defining a language there is a minimal one, namely, one with a minimal number 
of states. This is called a minimal deterministic finite automaton of the language and is 
denoted as minD FA(L). 

To determine whether a language is regular or not, sometimes the following 

Equivalence Relation is quite helpful. Any language LeE* introduces an equiva- 
lence relation R^ in S* with respect to L: any two elements x, y G S* are equivalent and 
denoted as xRlV if and only if for every zeE* both xz and yz either belong to L or not 
belong to L. As usual, the index of Rl is the number of equivalence classes in E* with 
respect to L. An equivalence class may be represented by any element of that class, say, 
x G L, we will denote its equivalence class by [x]. 

So far we have used only general notions of language theory. The importance of the 
equivalence relation R L is due to the following 



Myhill-Nerode Theorem (see references in fl4||): 



1. The language L is regular if and only if the index of Rl is finite. 

2. The language L being regular implies that mmDFA(L) is unique up to an isomor- 
phism, namely, renaming of the states. 

3. The number of states of minDFA(L) is given by the index of R^. 

6.2 Factorizable Language 

Once a language L G S* has been defined, its complementary set V = X* — L contains all 
words that do not appear in L. A language L is called factorizable if any substring of a 
word x G L also belongs to L. In this case the complementary set V contains a minimal 
core L" such that although any word x G L" is forbidden in L, but any proper substring 
of x belongs to L. Sometime we simply call L" the set of forbidden words. It is nothing 
but what S. Wolfram called Distinct Excluded Blocks (DEBs) in the grammatical analysis 



of cellular automata^. Owing to the factorizability we can express the complementary 
set as V = X*L"E*. This means that L is entirely determined by the minimal set of 
forbidden words or DEBs. Written in set theory terms we have 

L = S* - £*L"E*. 



There are at least two important classes of factorizable language: dynamical language 
and the language defined by a complete genome. 

It is a natural consequence of dynamical evolution that symbolic sequences encoun- 
tered in symbolic dynamics of dynamical systems come under the definition of factorizable 
language, as any small part of a trajectory is also produced by the same dynamics. Fur- 
thermore, these languages are prolongable as one can always append at least one letter 
from the alphabet to make an admissible word longer. Factorizability and prolongability 



[1- 



together make the class of dynamical languagesii-3 . However, we will not make use of 
prolongability in the context of this work. 

A second class of factorizable language may be defined from a complete genome. 
Given a complete genome G of an organism, consisting of one or more linear or circular 
DNA sequences. One cuts the DNA sequences into all possible subsequences and forms a 
language L = sub(G) by collecting these subsequences, including the empty string. This 
language is factorizable by definition. It is almost prolongable if one does not extend 
it beyond the total length of the genome. The factorizability alone is enough for our 
purpose. 

6.3 Minimal Deterministic Automaton Accepting 

the Aquifex aeolicus Genome 

Now we show how language theory works on our familiar example of the Aquifex aeolicus 
complete genome. Although there are longer avoided strings we take the set B given by 
Eq. (H) to be its set L" of forbidden words for the time being. Since B is finite, the 
factorizable language defined by B is regular. In order to construct the automaton we 
have to know all the equivalence classes of S* with respect to L. We make use of the 
following mathematical result cj. 

Let L be a factorizable language and L" be its set of all DEBs. Define 

V = {v , v is a proper prefix of some y G L"}. 

Then for each word x G L there exists a string v G V such that is equivalent to x, or, 
in our notations, xRi,v. In other words, all equivalence classes of S* with respect to L 
are represented in the set V . Therefore, in order to find all equivalence classes of E* 
with respect to L it is enough to work with L" . We note in passing that [e] is always an 
equivalence class, and the complementary set V makes another equivalence class. 
From the proper suffixes of the avoided strings in B we get the set 

y = {g,gcigcg,gcgc,gcgcg,gcgcgc,c,cg,cgc,cgcg, 
cgcgc, cgcgcg, t, tg, tgc, tgcg, tgcgc, tgcgcg}. 

By checking the equivalence relations among these strings only 13 out of 18 are kept 
as representatives of each class. Adding the class [L'\ C X* we get the following 14 
equivalence classes of X*: 

[e] [g] [gc] [gcg] [gcgc] [gcgcg] [gcgcgc] 
[c] [eg] [cgc] [cgcg] [cgcgc] [cgcgcg] [L'\. 

We note that the task of "checking the equivalence relations" may seem formidable 
as the requirement "for every z G S*" concerns an infinite set. However, a little practice 
shows that this may be done effectively without too much work. 
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[e] 

[9] 

[gc] 

[gcg] 

[gcgc] 

[gcgcg] 

[gcgcgc] 

[c] 

[eg] 

[cgc] 

[cgcg] 

[cgege] 

[cgcgcg] 



a 

H 

[e] 

W\ 

[e] 

[e] 

[el 



[c] 

[go] 

[c] 

[gcgc] 

[c] 
[gcgcgc] 

[c] 
[c] 

[cgc] 

[c] 
[cgege] 

[c] 



[g] 

[g] 

[gcg] 

[g] 
[gcgcg] 

[g] 
W\ 
[eg] 
[g] 
[cgcg] 

[g] 

[cgcgcg] 

[g] 



Table 4: The transfer function for the minimal deterministic automaton for Aquifex ae- 
olicus. 

The transfer function is defined by 

5([xj], s) = [xis] for Xi G V and s G E. 

Therefore, our task is to attribute each [xis] to one of the existing equivalence classes. The 
discrete transfer function is listed in Table |j. The particular function relation #([xj], s) = 
[L 1 ] leads to a "dead end" . 

One can draw the minimal deterministic automaton according to the above transfer 
function. As it is no longer a planar graph we do not show it here. By counting the 
number of lines leading from one state to another, we write down an incidence matrix: 



M 



1 

1 1 
1 

1 1 
1 
1 1 



2 
1 

2 
1 
2 
1 
2 
2 
1 
2 
1 
2 
1 



The columns and rows of the matrix M are ordered as elements in the first column in 
Table |4] of the transfer function. 

To make connection with the generating function (0) we note that the characteristic 
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polynomial of M is related to /(1/A): 

det(XI-M) = \ 13 f(\). 

A 

Moreover, the sum of elements in the first row of the i^-th power of M is nothing but 

The summation runs over all equivalence classes except for V '. We list the elements of 
the first row of M K in columns of Table H. 



K 



10 



11 



1 


4 


16 


64 


256 


1024 


4095 


16378 


65501 


261960 


1047664 


1 


2 


8 


32 


128 


512 


2048 


8190 


32756 


131002 


523920 





1 


2 


8 


32 


128 


512 


2048 


8190 


32756 


131002 








1 


2 


8 


32 


128 


512 


2048 


8190 


32756 











1 


2 


8 


32 


128 


512 


2048 


8190 














1 


2 


8 


32 


128 


512 


2048 

















1 


2 


8 


32 


128 


512 


2 


7 


28 


112 


448 


1792 


7168 


28665 


114640 


458483 


1833624 





2 


7 


28 


112 


448 


1792 


7168 


28665 


114640 


458483 








2 


7 


28 


112 


448 


1792 


7168 


28665 


114640 











2 


7 


28 


112 


448 


1792 


7168 


28665 














2 


7 


28 


112 


448 


1792 


7168 

















2 


7 


28 


112 


448 


1792 


Sum: 4 


16 


64 


256 


1024 


4096 


16380 


65509 


261992 


1047792 


4190464 














-4 


-27 


-152 


-784 


-3840 



Table 5: Elements of the first rows of Mk (shown as columns) and their sum. The negative 
numbers in the last row are the difference between ax and 4 K . 



The negative numbers in the last row of Table [| show the difference between ax and 
4 K . They are precisely the coefficients in the expansion (|T0|) of 1/(1 — 4s) — f(s), shown at 
the end of Section |5.2| . We see that the transfer function and the incidence matrix contain 
more detailed information on the combinatorial problem than the generating function 
alone. The implication of this approach needs to be further elucidated. 

In order to avoid any confusion we emphasize that the minimal deterministic automa- 
ton defined by the transfer function given in Table § accepts a regular language determined 
by the set B of four forbidden words. This language is larger than the language sub(G) 
obtained from the complete genome of Aquifex aeolicus. By including more and more 
avoided strings into the set B the minimal automaton gets larger but the language it 
accepts approaches sub{G) gradually. However, the calculation becomes tedious. 
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